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Abstract 

Speedup measures how much faster we can solve the same problem using many cores. If 
we can afford to keep the execution time fixed, then quality up measures how much better 
the solution will be computed using many cores. In this paper we describe our multithreaded 
implementation to track one solution path defined by a polynomial homotopy. Limiting quality 
to accuracy and confusing accuracy with precision, we strive to offset the cost of multiprecision 
arithmetic running multithreaded code on many cores. 



X ' 1 Introduction 

5-H ' 



Solving polynomial systems by homotopy continuation proceeds in two stages: we first define a 
family of systems (the homotopy) and then we track the solution paths defined by the homotopy. 
Tracking all paths is a pleasingly parallel computation. The problem we consider in this paper is 
to track one solution path. While tracking only one solution path could occur for huge problems 
(for which it is no longer feasible to compute all solutions), or the need to track one difficult 
solution path for which multiprecision arithmetic is required often arises for larger systems. 

On a multicore workstation, we experimentally determined in [W\ thresholds on the dimension 
of the problem to achieve a good speedup for the components of Newton method, using the 
quad double software library QD-2.3.9 [5j. While polynomial evaluation often dominates the 
computational cost, it pays off to run also multithreaded versions of the Gaussian elimination 
stage of Newton's method. In this paper we describe our multithreaded path tracker. 

The idea to use floating-point arithmetic as implemented in QD-2.3.9 to increase the working 
precision dates back to [3]. In [8], this idea is described as an error-free transformations because 



*This material is based upon work supported by the National Science Foundation under Grant No. 0713018 and 
Grant No. 1115777. 



1 



with double doubles we can calculate the error of a floating-point operation. From a computational 
complexity point of view, double doubles are attractive because the cost overhead is similar 
to working with complex arithmetic. For multithreading - using concurrent tasks accessing 
shared memory - double doubles are very attractive because blocking memory allocations and 
deallocations do not occur. 

As in [10] we continue the development of double double and quad double arithmetic in our 
path trackers and experimentally determine thresholds on the dimensions and degrees to achieve 
a good speedup. In the next section we relate our calculations with the less commonly used 
notion of quality up [JJ. As double floating-point arithmetic has become the norm, we can ask 
how much faster our hardware should become for double double arithmetic to become our default 
precision? We illustrate the computation of quality up factors in the next section. 

In [TU] we experienced that for homotopy continuation methods, polynomial evaluation is the 
dominating cost factor, exceeding the cost of the linear system solving as required in Newton's 
method - although it still pays off to multithread Gaussian elimination. For Newton's method 
we not only need to evaluate polynomials but also all derivatives with respect to all unknowns 
are needed in the Jacobian matrix. Using ideas from algorithmic differentiation [4] we have been 
able to reduce the dominating cost factor. 

Another application area for the techniques of this paper is the deflation of isolated singu- 
larities [7j. The deflation method accurately locates singular solutions at the expense of adding 
higher derivatives to the original system, essentially doubling the dimension in every stage. Any 
implementation of this deflation will benefit from increased precision and efficient evaluation of 
polynomials and all their derivatives. In this setting the granularity of the parallelism must be 
fine and the use of multithreading is needed. 

Acknowledgements. The ideas for the quality up section below were developed while preparing 
for an invited talk of the second author at the workshop on Hybrid Methodologies for Symbolic- 
Numeric Computation, held at MSRI from 17 to 19 November 2010. The second author is grateful 
to the organizers of this MSRI workshop for their invitation. 

2 Speedup and Quality Up 

When using multiple cores, we commonly ask how much faster we can solve a problem when using 
p cores. Denoting T p the time on p cores, then the speedup is defined as T\/T p , i.e.: the time on 
1 core divided by the time on p cores. In the optimal case, the speedup will converge to p: with 
p cores we can solve the same problem p times faster. 

In addition to speedup, we like to know how much better we can solve the problem when 
using p cores? Our notion of quality up as an analogue to speedup is inspired by Selim Akl's 
paper [TJ. We define quality as the number of correct decimal places in the computed solution. 
Denoting Q p as the quality obtained using p cores, we define quality up as Q p /Qi, keeping the 
time fixed. As with speedup, we could also hope for a quality up factor of p in the optimal case. 

Because the number of correct decimal places in a numerical solution depends on the sensitivity 
of the solution to perturbations in the input and is (bounded by condition numbers, we assume 
our problems are well-conditioned deliberately confusing working precision with accuracy. Taking 
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a narrow view on quality, we define 

Q p # decimal places with p cores 



quality up 



Qi # decimal places with 1 core 



Often multiprecision arithmetic is necessary to obtain meaningful answers and then we want to 
know how many cores we need to compensate for the overhead caused by software driven arith- 
metic. Using the quad double software library QD-2.3.9 [5j, we experimentally determined in |10| 
that the computational overhead of using double double arithmetic over hardware double arith- 
metic on solving linear systems with LU factorization averaged around eight. This experimental 
factor of eight is about the same overhead factor of using complex arithmetic compared to real 
arithmetic. The number eight also equals the number of cores on our Mac OS X 3.2 Ghz Intel 
Xeon workstation. 

To estimate the quality up factors, we assume an optimal (or constant) speedup. Moreover, 
we assume that the ratio Q p /Qi is linear in p so we can apply linear extrapolation. To illustrate 
the estimation of the quality up factor, consider the refinement of the 1,747 generating cyclic 
10-roots. The cyclic 10-roots problem belongs to a well known family of benchmark polynomial 
systems, see for instance [2], [6], or [9]. To compare quality up, we compare the 4.818 seconds 
of real time with one core using double double complex arithmetic to the 8 . 076 seconds of real 
time using quad double arithmetic using 8 cores. With 8 cores we double the accuracy in less 
than double the time. As this refinement is a pleasingly parallel calculation, the assumption that 
the speedup is optimal is natural. The concept of quality up requires a constant time, so we ask: 
how many cores do we need to reduce the calculation with quad doubles to 4.818s? 

x 8 = 13.410 14 cores 

4.818 

Denoting y{p) = Q p /Q± and assuming y(p) is linear in p, we have y(l) = 1 and y(14) = 2, so 
we interpolate: 

2/(14) - , 
y(p) - y(l) = ^— (p - 1). 

7 

and the quality up factor is y(8) = 1 + — ~ 1.538. The interpretation for the factor 1.538 is as 

follows: in keeping the total time fixed, we can increase the working precision with about 50% 
using 8 cores. 



3 Multithreaded Path Tracking 

Given a homotopy h(x,t) = and a start solution at t = 0, the path tracker returns the solution 
at the path at t = 1. A path tracker has three ingredients: a predictor for the next value of t 
and the extrapolated corresponding values for x; Newton's method as a corrector, keeping the 
predicted value for t fixed; and a step size control algorithm. 

Algorithm 13.11 provides pseudo code for the multithreaded version of a path tracker. Every 
thread executes the same code. Variables that start with my_ are local to the thread. The first 
thread manages the flags used for synchronization. Because threads are created once and remain 
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allocated to the path tracking process till the end, idle threads are not released to the operating 
system, but run a busy waiting loop. 



Algorithm 3.1 (Multithreaded Path Tracking). 

Input: /i(x, t) = 0, z. 
Output: z: h(z, 1) = or fail. 



homotopy and start solution 
solution at end of path or failure 



stop := false; 

A := initial step size; 

while (t < 1 and not stop) do 

while (corr_Ind < my_corr_Ind ) wait; 
if (my_ID = 1) then 
predict(z, t, A); 
predJnd := pred_Ind+l; 
end if; 

while (predJnd < corr_Ind+l) wait; 
Newton(my_ID, h, z, e, Max_It, success); 
Newton_Ind[my_ID] := Newton_Ind[my_ID] + 1; 
while (3 ID: Newton_Ind[ID] < corrJnd+1) wait; 
if (my_ID = 1) then 

step_size_control(A, success); 

step_back(z, t, success); 

stop := stop_criterion(A,corr_Ind); 

corr_Ind := corr_Ind + 1; 
end if; 

my_corr_Ind := my_corr_Ind + 1; 
end while; 
fail := not stop. 



initializations 
all other variables are set to 

wait till previous post correction 
prediction done by thread 1 
new z and t 
signal that prediction done 

wait till prediction done 
run multithreaded Newton 
thread my_ID is done 
wait till correction terminates 
step size control by thread 1 
adjust step size 
step back if no success 
A too small or corr_Ind too large 
step size control is done 

continue to next step in while 

failure if stopped with t < 1 



Prediction and step size control are relatively inexpensive operations and are performed en- 
tirely by the first thread. Newton's method is computationally more involved and is executed in 
a multithreaded fashion. 



4 Multithreaded Newton Method 

In this section we focus on our multithreaded version of Newton's method using multithreaded 
polynomial evaluation and linear system solving described in [10]. Following the same notational 
conventions as in Algorithm 13.11 pseudo code is described in Algorithm 14.11 below. 
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Algorithm 4.1 (Multithreaded Newton's Method). 



Input: /i(x, t) = 0, z; 

e, MaxJt. 
Output: z: \\h(z, t)\\ < e or fail. 



homotopy and initial solution 
tolerance and maximal ^iterations 
corrected solution or failure 



i : = 0; 

\\h(z,t)\\ :=1; 

while (\\h(z,t)\\ > e) and (i < MaxJt) do 

V := Monomial_Evaluation(my_ID,/i,z); 
Status_MonVal[my_ID] := 1; 
if (my_ID = 1) then 

while (3 ID: Status_MonVal[ID] = 0) wait; 
for all ID do Status_MonVal[ID] := 0; 
MonJnd := MonJnd + 1; 
end if; 

while (MonJnd < my_Iter+l) wait; 

Y := Coefficient_Product(my_ID,"l/,/i); 
Status_Coeff[my_ID] := 1; 
if (my_ID = 1) then 

while (3 ID: Status_Coeff[ID] = 0) wait; 
for all ID do Status_Coeff[ID] := 0; 
\\h{z,t)\\ := Residual(Y); 
CoefLInd := CoefLInd + 1; 
end if; 

while (CoefLInd < my_Iter+l) wait; 
Ab := GE(my_ID,my_Iter,Y, pivots); 
m := (n — l)(my_Iter+l); 
while (3 ID: pivots[ID] < m) wait; 
Back_Subs(my_Id,my_Iter,yl&,Az,BS_Ind); 
while (BS_Ind < my_Iter+l) wait; 
if (my_ID = 1) then 

z := z + Az; i := i + 1; 

z_Ind := z_Ind + 1; 
end if; 

while (z_Ind < my_Iter+l) wait; 
my_Iter := myJter + 1; 
end while; 

fail := \\h{z,t)\\ > e. 

The array Y contains the evaluated polynomials of the polynomial system as defined by the 
homotopy /i(x, t) = along with all partial derivatives as needed in the Jacobian matrix. The 
evaluation of the all polynomials as described in [10] occurs in two stages: first the values of 
all monomials are stored in V and then we multiply with the coefficients to obtain Y. The 
partitioning of the work load between the threads is such that no synchronization within the 
procedures MonomiaLEvaluation and Coefficient_Product is needed. 



count ^iterations 
initialize residual 

multithreaded monomial evaluation 
thread done with monomial evaluation 
flag adjustments for next stage 
thread 1 waits 
flags reset for next stage 
update monomial counter 

wait till all monomials are evaluated 
multiply monomials with coefficients 
thread done with coefficient product 
flag adjustments for next stage 
thread 1 waits 
flags reset for next stage 
calculate residual 
update coefficient counter 

wait till all polynomials are evaluated 
row reduction on Jacobi matrix 
used for synchronization 
wait for row reduction to finish 
multithreaded back substitution 
wait till back substitution done 

update solution 
counter to update z 

wait till solution is updated 
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The array Y contains then all the information needed to set up the linear system Ax = b. The 
row reduction with pivoting is performed on the augmented matrix [A b] , denoted in the algorithm 
by Ab. For numerical stability, we apply pivoting in the routine GE of Algorithm ic li The pivoting 
implies that within the procedure GE synchronization is needed. For synchronization in GE, we 
follow the same protocol: the first thread selects the pivot element (the largest number in the 
current column). After the selection of the pivot row, all threads can update their preassigned 
part of the matrix. The assignment of rows relates the row number ot the identification number 
of the thread. The selection of the pivot row must wait till all threads have finished modifying 
their rows. 

The output of the procedure GE is passed to the back substitution procedure Back_Subs. As 
the back sustitution solves a triangular system, inside the routine synchronization is necessary 
for correct results. 

To project the speedup for path tracking a system, we generate a system of 40 variables with a 
common support of 200 monomials. Every monomial has degree 40 on average with 80 as largest 
degree. We simulated 1,000 Newton iterations and results are reported in Table [TJ 

40-by-40 system, 1000 times 



^threads 


Pol.Ev. 


Gauss. El. 


Back Subs. 


Total 


speedup 


1 


35.732s 


4.849s 


0.197s 


40.778s 


1 


2 


17.932s 


3.113s 


0.100s 


21.145s 


1.928 


4 


9.248s 


1.824s 


0.062s 


11.134s 


3.662 


8 


4.775s 


1.349s 


0.053s 


6.177s 


6.602 



Table 1: Elapsed wall clock time for increasing number of threads for polynomial evaluation, 
Gaussian elimination and back substitution. The speedup is calculated for the total time. 

For the generated problem the polynomial evaluation dominates the total cost. In Table Q] 
we see that once we reduced the cost of polynomial evaluation using 8 cores, the wall clock time 
becomes less than the total time spent on Gaussion elimination with one core. While multicore 
row reduction has a less favorable speedup compared to polynomial evaluation, we see that the 
multithreaded version is beneficial for the total speedup. 

5 Effect of a Quadratic Predictor 

Our multithreaded implementation achieves the better speedups the bigger is the ratio of di- 
mension of the system to the number of engaged cores. Thus we work with larger dimensions. 
Secant predictor, which is merely efficient for systems of smaller dimensions becomes extremely 
unefficient for larger dimensions. We need to come up with a predictor, which would better ap- 
proximate local intricate behaviour of a curve in multidumensional space. A suitable option for 
this proved to be the following quadratic predictor: 

• Tracking a path, we keep approximate solutions x* , x* rev , and x* revl of intermediate sys- 
tems, corresponding to the three most recent values of the homotopy parameter t prev i < 
tprev < t respectively. 
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• For each index i, the coordinate x*[i] of the new initial guess x* for the solution on the 
path of the intermediate system associated with the value of the homotopy parameter (t + 
current step size), is computed independently of other coordinates as following: 

1. We interpolate points (tprevi,x* evl [i\), (t prev , x* rev [i]), and (i,x*[i]) by a parabola. 

2. The new x*[i] is then the value of this parabola at the point (t + current step size). 

Since each coordinate of such guess for a new intermidiate system is computed independently 
of all the others, and all what it requires is computing just one value of interpolating three 
points parabola, which is done by a finite fixed number of algebraic operations, the complexity 
of such predictor depends linearly on the dimension of the system. Thus the portion of quadratic 
predictor computation in the entire path tracker computation is negligable. There is no reason 
to multitask quadratic predictor therefore, despite apparently it could be effectevely done with 
a minimal effort. On the other hand, despite its very low computational time cost, use of the 
described above quadratic predictor instead of the secant predictor brings dramatic gain in the 
number of needed corrections to track a path. In our experiments we tracked on 8 cores a 
solution path for a system of dimension 20, with 20 monomials in each polynomial, with each 
monomial of maximal degree 2 using both predictors. When using the secant predictor, it required 
113623 succesful corrections , with a running time 26m53.008s, minimal step size 6.10352e-07, and 
average step size 9.0404e-06, and when using the quadractic predictor, it required 572 succesful 
corrections , with a running time 8.863s, minimal step size 0.00016, and average step size 0.00019. 
In paricular the running time, when using quadratic predictor was about 180 less than when using 
the secant predictor. In all our other numerious experiments with systems of big enough various 
dimensions the gain of using the quadratic predictor kept to be of the same oreder. For a system 
of dimension 40 a run on 8 cores with a use of the quadratic predictor may take several minutes 
while a run with a use of the secant predictor may take several days. 

The quadratic predictor provides very suitable balance between its low computational com- 
plexity and reduction in number of corrections it brings, thus ensuring a considerable, and prob- 
ably one of the best possible, gain in absolute running time when tracking a path. The tables 
below show timings that illustrate the beneficial effect of using a quadratic predictor. On systems 
of the same dimension and degrees, Table [2] shows experimental results of runs with a secant 
predictor. Comparing the data of Table [2] with Table O we observe significant differences in the 
average and minimal step sizes along a path. Timings in Table [2] and [3] are for runs on eight 
cores. 



6 Faster Evaluation of Polynomials and their Derivatives 

In this section, we describe our application of techniques of algorithmic differentiation [4j. 

Along with each normalized monomial xf 1 xf 2 ■ ■ ■ xf k with 1 < i\ < io < ■ ■ ■ ih < n and 
oi,...,Ofc > 1, which appears in the original homotopy H(x,t) = 0, there appear associated 
to it monomials x h x i2 x ik , x ix x i2 ■ ■ ■ x ifc , . . ., x h x h x ik m dx ^ , dx ^, Qx ,^ 
respectively. Employing the fact that the exponents of the monomial partial derivatives do not 
differ much from the exponents of the original monomial, we compute the values of the original 
monomial and of the k associated to it monomials in partial derivatives as follows: 
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20-by-20 systems, monomials of degree 10, secant predictor 





#succ. corrs 


#corrs 


time 


avg step 


min step 


Syst. 1 


141244 


142657 


3h55m3 . 


559s 


7.28E-06 


1.22E-06 


Syst. 2 


176512 


178274 


8h34m5 . 


082s 


5.83E-06 


3.05E-07 


Syst. 3 


150112 


151612 


7h5m29 . 


649s 


6.85E-06 


3.05E-07 


Syst. 4 


125231 


126483 


7h26mll. 


352s 


8.21E-06 


1.22E-06 


Syst. 5 


187772 


189645 


9hl9m29 . 


869s 


5.48E-06 


3.05E-07 


Average 


156174.2 


157734.2 


7hl6m7 . 


900s 


6.73E-06 


6.71E-07 


St. Dev. 


25637.81 


25892.2 


2h4m31. 


303s 


1.11E-06 


5.01E-07 



Table 2: For five differently generated systems, we respectively report the number of successful 
corrector stages, the total number of corrections, the total time, the average and minimal step 
size along a solution path, using a secant predictor. 

20-by-20 systems, monomials of degree 10, quadratic predictor 





#succ. corrs 


#corrs 


time 


avg step 


min step 


Syst. 1 


571 




624 


0m59 


552s 


1.91E-03 


3.13E-04 


Syst. 2 


791 




864 


2m34 


505s 


1.37E-03 


7.81E-05 


Syst. 3 


668 




730 


2m4 


725s 


1.62E-03 


3.91E-05 


Syst. 4 


528 




578 


lm39 


336s 


2.06E-03 


1.56E-04 


Syst. 5 


848 




924 


2m39 


015s 


1.27E-03 


7.81E-05 


Average 


681.2 




744 


lm59 


427s 


1.65E-03 


1.33E-04 


St. Dev. 


137.538 


149 


124 


0m41 


275s 


3.39E-04 


1.09E-04 



Table 3: For five differently generated systems, we respectively report the number of successful 
corrector stages, the total number of corrections, the total time, the average and minimal step 
size along a solution path, using a quadratic predictor. 



1 . We first compute the common factor a;? 1 ■ ■ ■ x^ of the monomial and its deriva- 
tives. 

j=k 

2. We multiply the value of the common factor by uj m = J~J Xi j , m = 1, 2, . . . , k, to get the 

i=i 

values of the monomial partial derivatives. 

3. We multiply x^~ x x^ ■ ■ ■ a;®* by x ix to obtain the original monomial. 

The products u m , for m = 1, 2, . . . , k we obtain in 3A; — 6 multiplications in the following fashion: 

1. Recursively we get all products ifj m = x h x i2 ■ ■ ■ x im , m = 1, 2, . . . , k - 1 by ty m = ip m _ix im , 
ipi = x h . 

2. Similarly we obtain all products ip m = x ik x ik _ x ■ ■ ■ x ik _ m+1 , m = 1, 2, . . . , k - 1 by <p m = 

fm-l x ik- m +ii ^1 = x ik- 
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3. Finally we get the products uj m m = 1, 2, . . . , k as oj\ = tpk-i, u k = tpk-i, w m = i/> m -i<Pk-m+2, 
m = 2, . . . , k — 1. 

In [3], the evaluation of all derivatives of a product of variables is known as Speelpenning's 
example. Because we assume that our polynomials are sparse, we may focus on the individual 
monomials. For dense polynomials, a nested Horner scheme would be more appropriate. 

7 Computational Experiments 

The code was developed on a Mac OS X computer with two 3.2Ghz quad core Intel Xeon pro- 
cessors. For multithreading, we use the standard pthreads library and QD-2.3.9 for the quad 
double arithmetic. 

In Table 0] we list one generated example for the complete integrated multithreaded version 
of the path tracker, for a system of dimension 40, once with polynomials of degree 2 and once 
with polynomial of degree 20. In the latter case, we get a close to optimal speedup and also 
for quadratic polynomials, the speedup is acceptable. Comparing with Table [5j we see that the 
degree of the polynomials are the determining factor in achieving a good speedup. 

Dim=40, 40 monomials of degree 2 in a polynomial 



^threads 


real 


user 


sys 


speedup 


1 


5m25.509s 


5m25.240s 


OmO . 254s 


1 


2 


2m54.098s 


5m47.506s 


OmO. 186s 


1.870 


4 


lm38.316s 


6m31 .580s 


0m0.206s 


3.312 


8 


lm 2.257s 


8mll.l30s 


OmO. 352s 


5.226 


Dim=40, 40 monomials 


of degree 20 


in a polynomial 


^threads 


real 


user 


sys 


speedup 


1 


244m55.691s 


244m48.501s 


0m 6.621s 


1 


2 


123m 1.536s 


245m53.987s 


0m 3.838s 


1.991 


4 


61m53.447s 


247ml4.921s 


0m 4.181s 


3.958 


8 


32m22.671s 


256m27.142s 


0mll.541s 


7.567 



Table 4: Elapsed real, user, system time, and speedup for tracking one path in complex quad 
double arithmetic on a system of dimension 40, once with quadrics, and once with polynomials 
of degree 20. 

The results in Tables |4] and [5] are done without the faster evaluation of polynomials and their 
derivatives, so the degrees matter most in the speedup. With faster evaluation routines, the 
threshold on the dimension for the speedup will have to be higher. 

We end this paper with some preliminary sequential timings on using faster evaluation and 
differentiation schemes. In our previous implementation, the continuation parameter t was treated 
as just another variable and this led to an overhead of a factor 3. Table [6] contains experimental 
results. 
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Dim=20, 20 monomials of degree 2 in a polynomial 



^threads 


real 


user 


sys 


speedup 


1 


0m37.853s 


0m37.795s 


OmO . 037s 


1 


2 


0m21.094s 


0m42.011s 


0m0.063s 


1.794 


4 


0ml2.804s 


0m50.812s 


OmO. 061s 


2.956 


8 


0m 8.721s 


lm 8.646s 


OmO . 097s 


4.340 



Dim=20, 20 monomials of degree 10 in a polynomial 



^threads 


real 


user 


sys 


speedup 


1 


7ml7.758s 


7ml7 


617s 


OmO. 123s 


1 


2 


3m42.742s 


7m24 


813s 


OmO. 206s 


1.965 


4 


lm53.972s 


7m34 


386s 


OmO. 150s 


3.841 


8 


0m59.742s 


7m53 


469s 


OmO . 279s 


7.327 



Table 5: Elapsed real, user, system time, and speedup for tracking one path in complex quad 
double arithmetic on a system of dimension 20, once with quadrics, and once with polynomials 
of degree 10. 

8 Conclusions 

For polynomial systems where the computational cost is dominated by the evaluation of polyno- 
mials, the multithreaded version of our path tracker performs already well in modest dimensions. 
For systems of lower degrees, the threshold dimension for good speedup will need to be higher 
because then the cost of Gaussian elimination becomes more important. 
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20-by-20 systems, polynomial evaluation, 400 times, sequential execution 



degrees 


new alg.time 


old alg.time 


speedup 


pure speedup 


5 


4.267s 


36.521s 


8.59 


2.85 


10 


7.280s 


lm7.114s 


9.22 


3.07 


20 


11.304s 


2m3. 122s 


10.89 


3.63 


40-by-40 systems, polynomial evaluation, 


400 times, sequential execution 


degrees 


new alg.time 


old alg.time 


speedup 


pure speedup 


5 


19.855s 


4m58. 162s 


15.02 


5.01 


10 


36.737s 


8m48.765s 


14.39 


4.80 


20 


1ml .980ss 


16m4.541s 


15.56 


5.19 



Table 6: For a system of dimension 20 and 40, for increasing degrees, we list the times of the new 
algorithm, the old algorithm and the speedup. The pure speedup is the speedup divided by 3, to 
account for treating the continuation parameter t differently. 
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